Logical structure detection for heterogeneous document classes

نویسندگان

  • Leon Todoran
  • Marco Aiello
  • Christof Monz
  • Marcel Worring
چکیده

We present a fully implemented system based on generic document knowledge for detecting the logical structure of documents for which only general layout information is assumed. In particular, we focus on detecting the reading order. Our system integrates components based on computer vision, artificial intelligence, and natural language processing techniques. The prominent feature of our framework is its ability to handle documents from heterogeneous collections. The system has been evaluated on a standard collection of documents to measure the quality of the reading order detection. Experimental results for each component and the system as a whole are presented and discussed in detail. The performance of the system is promising, especially when considering the diversity of the document collection. Keywords: Document Analysis, Logical Structure Detection, Reading Order Detection, Natural Language Processing, Spatial Reasoning. 1. INTRODUCTION The goal of document analysis is to automatically process scanned documents and convert them into a digital format, which can for example be further processed for reproduction, digital libraries, information retrieval, and text-to-speech purposes. This process mainly consists of two steps: layout analysis and document understanding. During the layout analysis the constituents of the document image of a page are identified and classified as text or image objects and further font information, textual content, geometric features, and spatial relations are extracted. This information is captured in the layout structure. Document understanding takes the layout structure as input, further classifies its items into logical items (e.g., title, paragraph, etc.) and detects relations between them (e.g., the reading order). This information is captured in the logical structure. Most of the document analysis systems developed so far make use of specific a priori document knowledge and are therefore domain-dependent. The systems described in the literature and implemented in commercial software (e.g., FINEREADER1) can successfully handle simple black-and-white documents with a layout structure which is known in advance. The treatment of colored documents, complex layouts, and the analysis of a heterogeneous class of documents is definitely a challenging task and open question. In this paper we try to answer this question. Without using any document class-specific information, as the data set is composed by a large collection of document classes, we detect the logical structure. In particular, we focus on the reading order detection which is a fundamental part of the logical structure. The reading order is the sequence of textual document objects in which the user is going to (or is supposed to) read the document at hand. To detect the reading order in a scanned document of which the layout structure is available, we introduce two components that take full advantage of the layout information. The first component is based on formal methods. A spatial reasoner, using a set of document rules decides which reading orders are formally correct from the spatial point of view. The second component is based on natural language processing (NLP) and considers the text present in the textual document objects identifying the syntactically most plausible reading orders. In the last decade, several systems for detecting logical structures from scanned text have been developed. One example of a domain specific system has been developed by Tsujimoto and Asada to process multi-column black-and-white scientific Further author information: E-mail: 3 todoran,aiellom,christof,worring 4 @science.uva.nl URL: http://www.science.uva.nl/ 5 3 todoran,aiellom,christof,worring 4 papers.2 Both in layout and logical structure detection, the domain knowledge is used to derive the classification rules. The main shortcoming of this system is that it cannot be adapted to other classes of documents. Ishitani proposes to exchange information between layout and logical analysis, which are applied iteratively.3 This improves both layout and logical detection. But as in the previous case, this system can be used only for documents falling in one specific class. A step toward generality is made by Cesarini et al.4 Two distinct categories of knowledge are identified: specific to a class of documents and generic or independent from the class of documents. A pitfall is the use of XY-trees5 which reduces the generality of documents to which their system is applicable. For instance, color documents where overlapping is present cannot be processed. There have been attempts to automatically generate rules to detect the logical structure for a general class of documents. Sainz and Dimitriadis use a fuzzy-neural system to learn from a given training set what the rules are for converting the layout into the logical structure.6 Li and Ng propose a domain-independent document understanding system with learning ability.7 They use a directed weighted graph to represent the layout structure, allowing for a more general class of documents to be considered than by using tree representations. Both Sainz and Dimitriadi and Li and Ng use only geometrical information of the layout, but in the extraction of the logical structure the content has a key role. For example, when no a priori document knowledge is given, detecting the reading order of the textual elements of a document can only be achieved by considering the textual content of the elements themselves, which in turn implies the use of natural language processing. We have previously proposed a framework to extract the logical structure given the layout structure making some use of the content of the document. Some very preliminary experimental results were presented.8 In this paper, we extend the framework in two ways. On the one hand, we provide some vertical integration, that is, we do not assume anymore the logical object classification is available a priori. On the other hand, we use more effective spatial reasoning and natural language processing techniques. The remainder of this paper is organized as follows: In the next section we describe the adopted representation for the layout and logical structure of a document. In Section 3, we present the architecture of our system, and describe each of its components. Experimental results and evaluation are discussed in Section 4. 2. DOCUMENT REPRESENTATION The document image analysis can be seen as the inverse process of document authoring. Therefore these two processes should use similar document models. Requiring the document class to be generic, the document model should be able to represent any complex document structure. Rather than tree-based representations, we consider a graph-based one to encode the relations among document objects. Our model is a flexible representation suitable for a broad class of documents. A document 6 is a set of layout 7 and logical 8 structures: 6:9<;=7 >?8 @ . As for the layout (or geometric) structure 7 of a document 6 . Let A g be a set of layout document objects and B g a set of geometric relations between the document objects, such that 7C9D;EA g >?B g @ . In the current implementation of the system, the layout structure has three categories of layout objects: text, image and graphics. Because we have to deal with generic documents, we consider a flexible list of features rather than a fixed set. Besides the bounding boxes coordinates, the document object’s features considered are: font size ratio defined as ratio between the font size of the current document object, and the most common font size of the entire page; aspect ratio defined as width divided by height of the document object; area ratio defined as the ratio between the object’s area and the page area; content size defined as number of characters; font style with the possible values “Plain”, “Bold” and “Italic”. The same holds for relations: rather than keeping one single relation we have a list of relations. The spatial relations considered among document objects are adjacency and the product of the Allen’s relations on the two document axes: precedes, meets, overlaps, starts, during, finishes, equals (and their inverses).9 The adjacency relation is determined based on Voronoi diagrams.10 As for the logical structure, it is defined analogously to the layout structure as a set of logical document objects and a set of logic relations between them: 8$9:;EA l >FB l @ . 3. FROM LAYOUT TO LOGICAL STRUCTURE Assuming that the bounding boxes of textual document objects are given, our system extracts the logical structure from it, as shown in Figure 1. The first module (depicted at the bottom of the figure) assigns to the layout document objects A g logical labels, thus creating the set of logical objects A l . This process is described in detail in Section 3.1. For each type of A l the reading order is extracted independently, as presented in Section 3.2.1. These reading orders of each type of A l are then combined together into a set of admissible reading orders. On the right of the figure, a zoom-in of the spatial reasoning module is shown. This set is further reduced using the natural language processing module described in Section 3.2.2. GIHKJIL?MIN?OPN Q MKRSN MTQ U V0W0XZYT[]\_^ `=a bTcIcKd e d fgbKh d i?jkiKe?lnm_oPpkq rTs?t ugo?qSr?vSw xPugy p z {F| } ~ F€Kƒ‚T„T…I† {ZT‡Z} €IˆŠ‰?‹Z} | ‚?Œ‰FŒI„nŽF‰TP E‘ ’

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document image analysis with cooperative interaction between layout analysis and logical structure analysis

When a printed document is to be input to a computer system, the document must be converted to a computer-readable format, e.g., ASCII, PDF, RTF, CSV, or SGML/XML/HTML-tagged data. In order to obtain these data formats from a printed document, it is necessary to extract from the printed document as much information as possible, i.e., layout structure (layout objects and their hierarchical relat...

متن کامل

Decoupled Signal Detection for the Uplink of Large-Scale MIMO Systems in Heterogeneous Networks

Massive multiple-input multiple-output (MIMO) systems are strong candidates for future fifth generation (5G) heterogeneous cellular networks. For 5G, a network densification with a high number of different classes of users and data service requirements is expected. Such a large number of connected devices needs to be separated in order to allow the detection of the transmitted signals according...

متن کامل

Decoupled signal detection for the uplink of massive MIMO in 5G heterogeneous networks

Massive multiple-input multiple-output (MIMO) systems are strong candidates for future fifth-generation (5G) heterogeneous cellular networks. For 5G, a network densification with a high number of different classes of users and data service requirements is expected. Such a large number of connected devices needs to be separated in order to allow the detection of the transmitted signals according...

متن کامل

A Strategy for Retrospective Conversion of Documents

This paper proposes a strategy for retrospective conversion of documents. This strategy consists in an interpretation cycle where document analysis and document understanding interact. This cycle is initialized by the extraction of the outline of the layout and logical structures of the document. Then, each iteration of the cycle consists in the detection of inconsistencies in the document mode...

متن کامل

Learning Document Image Features With SqueezeNet Convolutional Neural Network

The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001